Breadth-First Search Crawling Yields High-Quality Pages

نویسندگان

Marc Najork

Janet L. Wiener

چکیده

This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages. We use the connectivity-based metric PageRank to measure the quality of a page. We show that traversing the web graph in breadth-first search order is a good crawling strategy, as it tends to discover high-quality pages early on in the crawl.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Focused crawling for both relevance and quality of medical information

Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful. To address problems of cost, coverage and quality, we ...

متن کامل

Profile-Based Focused Crawling for Social Media-Sharing Websites

We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user’s profile into two parts, an internal part, which comes from the user’s own contribution, and an external part, which comes from the user’s ...

متن کامل

Evaluation of a Graph-based Topical Crawler

Topical (or, focused) crawlers have become important tools in dealing with the massiveness and dynamic nature of the World Wide Web. Guided by a data mining component that monitors and analyzes the boundary of the set of crawled pages, a focused crawler selectively seeks out pages on a pre-defined topic. Recent research indicates that both the textual content of web pages and the structural inf...

متن کامل

Lexical Profiling of Existing Web Directories to Support Fine-grained Topic-Focused Web Crawling

Topic-focused Web crawling aims to harness the potential of the Internet reliably and efficiently, producing topic specific indexes of pages within the Web. Previous work has focused on supplying suitably general descriptions of topics to generate large general indexes. In this paper we propose a method that uses lexical profiling of a corpus that consists of hierarchical structures in existing...

متن کامل

Expanding Reinforcement Learning Approaches for Efficient Crawling the Web

The amount of accessible information on World Wide Web is increasing rapidly, so that a general-purpose search engine cannot index everything on the Web. Focused crawlers have been proposed as a potential approach to overcome the coverage problem of search engines by limiting the domain of concentration of them. Focused crawling is a technique which is able to crawl particular topical portions ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Breadth-First Search Crawling Yields High-Quality Pages

نویسندگان

چکیده

منابع مشابه

Focused crawling for both relevance and quality of medical information

Profile-Based Focused Crawling for Social Media-Sharing Websites

Evaluation of a Graph-based Topical Crawler

Lexical Profiling of Existing Web Directories to Support Fine-grained Topic-Focused Web Crawling

Expanding Reinforcement Learning Approaches for Efficient Crawling the Web

عنوان ژورنال:

اشتراک گذاری